D² Log Loss Score (d2_log_loss_score)#
d2_log_loss_score is an R²-style metric for probabilistic classification.
Instead of asking “How accurate are the predicted labels?” it asks:
How much did my model reduce log loss compared to a null model that ignores features and only predicts class proportions?
Learning goals#
By the end you should be able to:
explain log loss as “surprise” / negative log-likelihood
define and interpret the D² log loss score
implement
log_lossandd2_log_loss_scorefrom scratch in NumPysee (with plots) why overconfident wrong predictions are punished harshly
optimize a simple logistic regression with gradient descent and track D² during training
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio
from plotly.subplots import make_subplots
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import d2_log_loss_score, log_loss
from sklearn.model_selection import train_test_split
pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")
np.set_printoptions(precision=4, suppress=True)
rng = np.random.default_rng(42)
Quick import#
from sklearn.metrics import d2_log_loss_score
d2_log_loss_score expects predicted probabilities (predict_proba output), not hard class labels.
Prerequisites#
binary vs multiclass classification
probabilities and logs
log loss / cross-entropy
If you know R² for regression:
then D² is the same “fraction explained” idea, but with log loss instead of squared error.
1) Intuition: baseline vs your model#
Log loss as “surprise”#
For a single binary sample with true label \(y \in \{0, 1\}\) and predicted probability \(p = \Pr(y=1)\):
If you say “99%” and you’re right, you get a tiny loss.
If you say “99%” and you’re wrong, you get a huge loss.
That’s why log loss is great when probabilities matter: it rewards calibration and penalizes overconfidence.
D² turns log loss into an “explained fraction”#
D² compares your model to a null model that ignores features and always predicts the class proportions in the evaluation set.
If your model is no better than the null model: D² = 0
If your model is perfect: D² = 1
If your model is worse than the null model: D² < 0 (can be very negative)
So you can read D² as:
“How much of the baseline log loss did we eliminate?”
2) Definition (math)#
Let there be \(n\) samples and \(K\) classes.
True labels as one-hot indicators: \(y_{i,k} \in \{0,1\}\) with \(\sum_k y_{i,k} = 1\)
Predicted probabilities: \(p_{i,k}\) with \(\sum_k p_{i,k} = 1\)
Optional sample weights: \(w_i \ge 0\)
Log loss (sum form)#
Using the natural logarithm (units = nats):
(You can divide by \(\sum_i w_i\) to get a mean log loss; the D² ratio is unchanged.)
Null model#
The null model predicts constant probabilities equal to the weighted class proportions:
D² log loss score#
Properties:
Best possible score is 1.
0 means “no improvement over predicting class proportions”.
Can be negative and is unbounded below (log loss can blow up with confident wrong predictions).
Relationship to optimization#
For a fixed evaluation set, \(\operatorname{LL}(p^{\text{null}})\) is a constant.
So maximizing \(D^2_{\log}\) is equivalent to minimizing log loss:
3) NumPy from-scratch implementation#
We’ll implement two functions:
log_loss_numpy: the (weighted) cross-entropy / negative log-likelihoodd2_log_loss_score_numpy: the “fraction of log loss explained” relative to the null model
Notes:
We clip probabilities to avoid
log(0).For binary classification we allow probabilities shaped
(n_samples,)(positive class) or(n_samples, 2).
import warnings
def sigmoid(z: np.ndarray) -> np.ndarray:
# Numerically stable sigmoid
z = np.asarray(z, dtype=float)
out = np.empty_like(z, dtype=float)
pos = z >= 0
out[pos] = 1.0 / (1.0 + np.exp(-z[pos]))
exp_z = np.exp(z[~pos])
out[~pos] = exp_z / (1.0 + exp_z)
return out
def softmax(z: np.ndarray, axis: int = 1) -> np.ndarray:
# Numerically stable softmax
z = np.asarray(z, dtype=float)
z = z - np.max(z, axis=axis, keepdims=True)
exp_z = np.exp(z)
return exp_z / np.sum(exp_z, axis=axis, keepdims=True)
def _as_float_array(a):
return np.asarray(a, dtype=float)
def _check_sample_weight(sample_weight, n_samples: int) -> np.ndarray:
if sample_weight is None:
return np.ones(n_samples, dtype=float)
w = _as_float_array(sample_weight)
if w.shape != (n_samples,):
raise ValueError(f"sample_weight must have shape ({n_samples},), got {w.shape}.")
if np.any(w < 0):
raise ValueError("sample_weight must be non-negative.")
return w
def _infer_classes(y_true, labels):
y_true = np.asarray(y_true)
if labels is None:
classes = np.unique(y_true)
if classes.size < 2:
raise ValueError(
f"y_true contains only one label ({classes[0]!r}). "
"Please provide the true labels explicitly through the labels argument."
)
return classes
classes = np.asarray(labels)
if classes.size < 2:
raise ValueError("labels needs to contain at least two classes.")
return classes
def _label_to_index(classes: np.ndarray) -> dict:
return {label: i for i, label in enumerate(classes.tolist())}
def log_loss_numpy(
y_true,
y_proba,
*,
normalize: bool = True,
sample_weight=None,
labels=None,
eps: float | None = None,
) -> float:
# NumPy log loss for 1D label arrays.
# Parameters mirror sklearn.metrics.log_loss (subset).
y_true = np.asarray(y_true)
y_proba = _as_float_array(y_proba)
n_samples = y_true.shape[0]
if y_proba.shape[0] != n_samples:
raise ValueError("y_true and y_proba must have the same number of samples.")
classes = _infer_classes(y_true, labels)
label_to_idx = _label_to_index(classes)
try:
y_idx = np.array([label_to_idx[y] for y in y_true.tolist()], dtype=int)
except KeyError as e:
raise ValueError(f"Found unknown label {e.args[0]!r} in y_true.") from None
# Shape handling: allow binary probs as (n,) or (n, 1)
if y_proba.ndim == 1:
if classes.size != 2:
raise ValueError("1D y_proba is only supported for binary classification.")
y_proba = np.column_stack([1.0 - y_proba, y_proba])
if y_proba.ndim != 2:
raise ValueError("y_proba must be 1D or 2D.")
if y_proba.shape[1] == 1:
if classes.size != 2:
raise ValueError("(n, 1) y_proba is only supported for binary classification.")
p = y_proba[:, 0]
y_proba = np.column_stack([1.0 - p, p])
if y_proba.shape[1] != classes.size:
raise ValueError(
f"Number of probability columns ({y_proba.shape[1]}) does not match number of classes ({classes.size})."
)
if eps is None:
eps = np.finfo(y_proba.dtype).eps
row_sums = y_proba.sum(axis=1)
if not np.allclose(row_sums, 1.0, rtol=np.sqrt(eps)):
warnings.warn(
"The y_proba rows do not sum to 1. Make sure to pass probabilities.",
UserWarning,
)
y_proba = np.clip(y_proba, eps, 1.0 - eps)
p_true = y_proba[np.arange(n_samples), y_idx]
per_sample_loss = -np.log(p_true)
w = _check_sample_weight(sample_weight, n_samples)
loss_sum = float(np.sum(w * per_sample_loss))
if normalize:
return loss_sum / float(np.sum(w))
return loss_sum
def d2_log_loss_score_numpy(y_true, y_proba, *, sample_weight=None, labels=None) -> float:
# D² log loss score (fraction of log loss explained).
y_true = np.asarray(y_true)
y_proba = _as_float_array(y_proba)
n_samples = y_true.shape[0]
if y_proba.shape[0] != n_samples:
raise ValueError("y_true and y_proba must have the same number of samples.")
if n_samples < 2:
warnings.warn(
"D^2 score is not well-defined with less than two samples.",
RuntimeWarning,
)
return float("nan")
numerator = log_loss_numpy(
y_true,
y_proba,
normalize=False,
sample_weight=sample_weight,
labels=labels,
)
classes = _infer_classes(y_true, labels)
label_to_idx = _label_to_index(classes)
y_idx = np.array([label_to_idx[y] for y in y_true.tolist()], dtype=int)
w = _check_sample_weight(sample_weight, n_samples)
counts = np.bincount(y_idx, weights=w, minlength=classes.size)
priors = counts / float(np.sum(w))
y_proba_null = np.tile(priors, (n_samples, 1))
denominator = log_loss_numpy(
y_true,
y_proba_null,
normalize=False,
sample_weight=sample_weight,
labels=labels,
)
return 1.0 - (numerator / denominator)
Sanity check vs scikit-learn#
We’ll compare our NumPy implementation to:
sklearn.metrics.log_losssklearn.metrics.d2_log_loss_score
On typical inputs (at least 2 classes present), the values should match up to tiny floating-point differences.
# Binary sanity check
n = 500
y_true_bin = rng.integers(0, 2, size=n)
# Random probabilities (not great, but valid)
p_pos = np.clip(rng.uniform(0.02, 0.98, size=n), 1e-15, 1 - 1e-15)
print('log_loss sklearn:', log_loss(y_true_bin, p_pos))
print('log_loss numpy :', log_loss_numpy(y_true_bin, p_pos))
print('d2 sklearn :', d2_log_loss_score(y_true_bin, p_pos))
print('d2 numpy :', d2_log_loss_score_numpy(y_true_bin, p_pos))
# Multiclass sanity check
n = 600
k = 3
y_true_mc = rng.integers(0, k, size=n)
logits = rng.normal(size=(n, k))
proba = softmax(logits, axis=1)
print()
print('multiclass log_loss sklearn:', log_loss(y_true_mc, proba, labels=[0, 1, 2]))
print('multiclass log_loss numpy :', log_loss_numpy(y_true_mc, proba, labels=[0, 1, 2]))
print('multiclass d2 sklearn :', d2_log_loss_score(y_true_mc, proba, labels=[0, 1, 2]))
print('multiclass d2 numpy :', d2_log_loss_score_numpy(y_true_mc, proba, labels=[0, 1, 2]))
log_loss sklearn: 0.9737837898808718
log_loss numpy : 0.9737837898808718
d2 sklearn : -0.4054570626616003
d2 numpy : -0.4054570626616003
multiclass log_loss sklearn: 1.3819229190361464
multiclass log_loss numpy : 1.3819229190361464
multiclass d2 sklearn : -0.258491560577216
multiclass d2 numpy : -0.258491560577216
4) Plots: how log loss and D² behave#
4.1) Per-sample log loss curves (binary)#
For \(y=1\) the loss is \(-\log(p)\). For \(y=0\) it is \(-\log(1-p)\).
These curves explain the “overconfident wrong is catastrophic” behavior.
p = np.linspace(1e-6, 1 - 1e-6, 600)
loss_y1 = -np.log(p)
loss_y0 = -np.log(1 - p)
fig = go.Figure()
fig.add_trace(go.Scatter(x=p, y=loss_y1, mode='lines', name='y=1: -log(p)'))
fig.add_trace(go.Scatter(x=p, y=loss_y0, mode='lines', name='y=0: -log(1-p)'))
fig.update_layout(
title='Binary log loss for a single sample',
xaxis_title='Predicted probability p = P(y=1)',
yaxis_title='Per-sample log loss (nats)',
width=900,
height=450,
)
fig.show()
4.2) The null-model log loss is (empirical) entropy#
In the binary case, the null model predicts the positive rate \(\hat{p}\) for everyone.
Its mean log loss is the empirical Bernoulli entropy:
Maximum at \(\hat{p}=0.5\) (most uncertainty)
Small near 0 or 1 (almost-deterministic labels)
This matters because D² scales improvements by this baseline.
p_hat = np.linspace(1e-6, 1 - 1e-6, 600)
entropy = -(p_hat * np.log(p_hat) + (1 - p_hat) * np.log(1 - p_hat))
fig = go.Figure(go.Scatter(x=p_hat, y=entropy, mode='lines'))
fig.update_layout(
title='Baseline (null-model) mean log loss = Bernoulli entropy H(p̂)',
xaxis_title='Positive rate p̂',
yaxis_title='H(p̂) (nats)',
width=900,
height=450,
)
fig.show()
4.3) D² for constant predictors#
Let’s generate a dataset with a fixed class balance and score a constant predictor that always outputs \(q\).
At \(q = \hat{p}\) (the empirical positive rate), D² = 0 (that’s the null model).
Moving away from \(\hat{p}\) makes log loss worse → D² becomes negative.
This is a nice way to see D² as a scaled version of log loss.
n = 800
p_true = 0.25
y = rng.binomial(1, p_true, size=n)
q = np.linspace(1e-4, 1 - 1e-4, 500)
# Score constant predictors
logloss_vals = np.array([log_loss_numpy(y, np.full_like(y, qi, dtype=float)) for qi in q])
d2_vals = np.array([d2_log_loss_score_numpy(y, np.full_like(y, qi, dtype=float)) for qi in q])
p_hat = y.mean()
fig = make_subplots(rows=1, cols=2, subplot_titles=('Mean log loss', 'D² log loss score'))
fig.add_trace(go.Scatter(x=q, y=logloss_vals, mode='lines', name='log loss'), row=1, col=1)
fig.add_vline(x=p_hat, line_dash='dash', line_color='gray', row=1, col=1)
fig.update_xaxes(title_text='Constant probability q', row=1, col=1)
fig.update_yaxes(title_text='Mean log loss', row=1, col=1)
fig.add_trace(go.Scatter(x=q, y=d2_vals, mode='lines', name='D²'), row=1, col=2)
fig.add_hline(y=0.0, line_dash='dash', line_color='gray', row=1, col=2)
fig.add_vline(x=p_hat, line_dash='dash', line_color='gray', row=1, col=2)
fig.update_xaxes(title_text='Constant probability q', row=1, col=2)
fig.update_yaxes(title_text='D²', row=1, col=2)
fig.update_layout(
title=f'Constant predictors on a dataset with p̂={p_hat:.3f}',
width=1000,
height=430,
showlegend=False,
)
fig.show()
5) Using log loss (and D²) to optimize logistic regression (from scratch)#
Because D² is just a rescaling of log loss (by a constant baseline), training usually minimizes log loss directly.
Binary logistic regression#
Model with parameters \(\beta\):
Average log loss:
Gradient (classic result):
We’ll run gradient descent and track both:
log loss (lower is better)
D² log loss score (higher is better)
# Synthetic 2D classification problem
X, y = make_classification(
n_samples=1200,
n_features=2,
n_informative=2,
n_redundant=0,
n_clusters_per_class=1,
class_sep=1.4,
flip_y=0.06,
weights=[0.6, 0.4],
random_state=7,
)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.25, random_state=7, stratify=y
)
# Standardize (train statistics only)
mu = X_train.mean(axis=0)
sigma = X_train.std(axis=0) + 1e-12
X_train_s = (X_train - mu) / sigma
X_test_s = (X_test - mu) / sigma
# Add intercept term
X_train_i = np.column_stack([np.ones(X_train_s.shape[0]), X_train_s])
X_test_i = np.column_stack([np.ones(X_test_s.shape[0]), X_test_s])
# Quick look at the data
fig = px.scatter(
x=X_train_s[:, 0],
y=X_train_s[:, 1],
color=y_train.astype(str),
title='Training data (standardized)',
labels={'x': 'x1 (standardized)', 'y': 'x2 (standardized)', 'color': 'class'},
)
fig.update_layout(width=750, height=450)
fig.show()
def fit_logreg_gd(
X: np.ndarray,
y: np.ndarray,
*,
lr: float = 0.2,
n_epochs: int = 250,
l2: float = 0.0,
):
# Gradient descent for binary logistic regression.
# X: (n, d) with intercept column already included.
# y: (n,) in {0,1}
n, d = X.shape
w = np.zeros(d, dtype=float)
history = {
'epoch': [],
'train_log_loss': [],
'train_d2': [],
'w': [],
}
for epoch in range(n_epochs):
p = sigmoid(X @ w)
# Mean log loss (for nicer learning-rate behavior)
loss = log_loss_numpy(y, p)
# Gradient of mean log loss + optional L2
grad = (X.T @ (p - y)) / n
grad[1:] += l2 * w[1:] # don't regularize intercept
w -= lr * grad
history['epoch'].append(epoch)
history['train_log_loss'].append(loss)
history['train_d2'].append(d2_log_loss_score_numpy(y, p))
history['w'].append(w.copy())
return w, history
w_hat, hist = fit_logreg_gd(X_train_i, y_train, lr=0.25, n_epochs=240, l2=0.0)
p_train = sigmoid(X_train_i @ w_hat)
p_test = sigmoid(X_test_i @ w_hat)
print('Final (train) log loss:', log_loss_numpy(y_train, p_train))
print('Final (test) log loss:', log_loss_numpy(y_test, p_test))
print('Final (train) D² :', d2_log_loss_score_numpy(y_train, p_train))
print('Final (test) D² :', d2_log_loss_score_numpy(y_test, p_test))
Final (train) log loss: 0.12075603565990804
Final (test) log loss: 0.07421418195747594
Final (train) D² : 0.8206929808323062
Final (test) D² : 0.8897282386094872
# Plot training curves
fig = make_subplots(rows=1, cols=2, subplot_titles=('Log loss (train)', 'D² log loss score (train)'))
fig.add_trace(
go.Scatter(x=hist['epoch'], y=hist['train_log_loss'], mode='lines', name='log loss'),
row=1,
col=1,
)
fig.update_xaxes(title_text='epoch', row=1, col=1)
fig.update_yaxes(title_text='mean log loss', row=1, col=1)
fig.add_trace(
go.Scatter(x=hist['epoch'], y=hist['train_d2'], mode='lines', name='D²'),
row=1,
col=2,
)
fig.add_hline(y=0.0, line_dash='dash', line_color='gray', row=1, col=2)
fig.update_xaxes(title_text='epoch', row=1, col=2)
fig.update_yaxes(title_text='D²', row=1, col=2)
fig.update_layout(width=1000, height=420, showlegend=False)
fig.show()
Compare with scikit-learn#
We’ll fit sklearn.linear_model.LogisticRegression and score the predicted probabilities.
Because D² is computed from log loss, you should typically see the same ranking as with log loss (higher D² ↔ lower log loss) on the same dataset.
clf = LogisticRegression(penalty=None, solver='lbfgs', max_iter=2000, random_state=7)
clf.fit(X_train_s, y_train)
p_test_sklearn = clf.predict_proba(X_test_s)[:, 1]
print('sklearn log loss:', log_loss(y_test, p_test_sklearn))
print('numpy log loss:', log_loss_numpy(y_test, p_test_sklearn))
print('sklearn D² :', d2_log_loss_score(y_test, p_test_sklearn))
print('numpy D² :', d2_log_loss_score_numpy(y_test, p_test_sklearn))
sklearn log loss: 0.06472355069849435
numpy log loss: 0.06472355069849435
sklearn D² : 0.9038299722408762
numpy D² : 0.9038299722408762
6) Pros, cons, and when to use it#
Pros#
Interpretable baseline: 0 means “no better than predicting class proportions”.
Proper-probability sensitivity (via log loss): rewards well-calibrated probabilities, not just correct labels.
Works for multiclass naturally.
R²-like scale: best is 1, worse-than-baseline is negative.
Cons / caveats#
Unbounded below: very confident wrong predictions can drive the score to large negative values.
Dataset-dependent scaling: the baseline log loss depends on class balance; comparing D² across very different datasets can be misleading.
Requires probabilities: hard labels (
predict) aren’t enough.Not meaningful with too few samples: scikit-learn returns NaN for
n_samples < 2.Edge cases with missing classes: if an evaluation fold is missing a class, you may need to pass
labels=consistently.
Good use cases#
evaluating probabilistic classifiers where calibration matters (risk scores, medical, churn probability, etc.)
reporting an “explained fraction” style score to audiences familiar with R²
model selection within the same dataset / split (it ranks models the same way as log loss)
7) Pitfalls & diagnostics#
Overconfidence hurts: if your D² is very negative, inspect examples with high loss (often miscalibrated or shifted data).
Always check the null score: D²=0 is not “bad” — it means “you didn’t beat class proportions”.
Class imbalance: on highly imbalanced datasets, the null model can already achieve low log loss; large D² improvements may be harder.
Label ordering (binary with
(n_samples,)probabilities): scikit-learn assumes the probability is for the “positive” class (the second class in sorted order).Use calibration tools when needed: reliability diagrams, isotonic regression, Platt scaling.
Exercises#
Create two classifiers with the same accuracy but different probability calibration. Compare log loss and D².
For multiclass data, show how D² changes as you apply “temperature scaling” to logits.
Add L2 regularization to the gradient descent logistic regression and observe how it affects log loss and D².
References#
scikit-learn API:
sklearn.metrics.d2_log_loss_scorescikit-learn user guide: “D² score for classification” (see the docs page linked from the API)
C.M. Bishop (2006), Pattern Recognition and Machine Learning (cross-entropy / logistic regression)